-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vulkan: experimental coalesced read to shared memory before dequantization #10999
Conversation
small stuff scale cache there we go float type
Turns out the CI is failing as llvmpipe isn't happy with this change, though strangely enough it provides good output and tests clean with RADV. GL_EXT_shared_memory_block might be a way to do this properly provided I can find some examples on how to use it. |
On my RTX 4070, commit b3b30e4 is maybe 1-2% slower than master and commit 158ab15 is maybe half a percent faster than master.
Yeah, I was going to suggest this if you need to treat the data differently when storing it than when loading it. IIRC the syntax is just:
and then data in each block can potentially alias. |
This might actually be fine considering how I've forced it to use a subgroup size of 64. It should go faster with 32, but then again if you're memory bound this may not help much.
Thanks. I managed to get GL_EXT_shared_memory_block working properly with llvmpipe but it spits out gibberish on AMD, it looks like the driver has some bugs with this extension. Even the simple change below on master causes incorrect output on my GPU. ---------- ggml/src/ggml-vulkan/vulkan-shaders/mul_mat_vec_q6_k.comp ----------
index fab4ff5f..dfcae382 100644
@@ -1,6 +1,7 @@
#version 450
#extension GL_EXT_shader_explicit_arithmetic_types : require
+#extension GL_EXT_shared_memory_block : require
#include "mul_mat_vec_base.comp"
@@ -9,7 +10,9 @@ layout(local_size_x_id = 0, local_size_y = 1, local_size_z = 1) in;
layout (constant_id = 0) const uint BLOCK_SIZE = 32;
layout (constant_id = 1) const uint NUM_ROWS = 1;
-shared FLOAT_TYPE tmpsh[NUM_ROWS][BLOCK_SIZE];
+shared shblk {
+ FLOAT_TYPE tmpsh[NUM_ROWS][BLOCK_SIZE];
+};
void compute_outputs(const uint32_t first_row, const uint32_t num_rows) {
uint a_offset, b_offset, d_offset; |
I looked at the SPIR-V that gets generated and I think what's happening is the ArrayStride for the inner array of tmpsh is being set to 128. This is a known limitation of SPIR-V where structure layouts that depend on spec constants don't really work because the layout doesn't get updated based on the spec constant. If you had Keep in mind that GL_EXT_shared_memory_block is probably not supported everywhere, so we'd need to have a fallback path. So need to decide whether the perf we get from it is worth the additional maintenance cost. |
We would definitely need a fallback path. I think all modern desktop devices support the required extension Edit: You can check how broad the support is on https://vulkan.gpuinfo.org/listextensions.php |
In that case it's definitely not worth it unless it's a huge improvement. Actually regular layout aliasing is working well for me and I'm currently getting 24.6 t/s using a modified I also tried the same caching method with the B vector and it made things much slower. Now if subgroup operators actually worked properly (wink wink Intel) it would also be worth testing subgroup shuffles to see how much faster it is compared to shared memory. |
We could give that another try, the issue might have been subgroups that were not the right size or not fully enabled. We now support the extension to enforce full subgroups. |
Surprisingly enough subgroup shuffles ended up being a couple percent slower than shared memory, both for my Q6_K experiment and applied to the IQ4_NL LUT. Considering how close the IQ4_NL and Q4_0 results are it's safe to say that the on chip shared memory is really fast, but I didn't expect it to be faster than a direct register shuffle. Anyways I'm closing this PR as I'll be busy with work starting next week and the performance improvements here just aren't big enough to be worth all this effort. I'll deliver up the non shared memory related fixes for Q6_K in a separate PR. |
This is an ugly and hacky experiment that's kept as a draft for a reason, but it makes Q6_K run almost as fast as Q4_K_M. Basically the idea is to first read the superblocks into shared memory using perfectly coalesced loads and then do the actual dequantizations using the fast shared memory. As this completely separates the memory reads from the dequantization code this also allows us to explore different algorithms for dequantization that don't rely on coalescing.
With any other language I'd be able to easily do this, but GLSL has no pointers, memcpy, and all that. For some reason though it lets me index an array of structs beyond it's stated size to access the next element, and that doesn't cause an error or crash my computer. As a result this is not suitable for release unless we have a way of doing this properly, and I'm just leaving this here for feedback and suggestions.
It's possible to make this go faster by using 32-bit loads per thread if the alignment allows for it, but again it's the language that's holding me back. The current code runs with a fixed 64 subgroup size.
In 158ab15 I also have a proper version of this which runs at 23 t/s with the same Q6_K model. Here the loads are coalesced for each 16 thread superblock only and no hacks are required, but I think it only helps with AMD GCN as it's 64 wide SIMD is actually 4 16 wide units internally (and the loads are done in blocks of 16 threads).